This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

# Import required R libraries
library(fpp3)
library(tidyverse)
library(readxl)
library(seasonal)
library(stringr)

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")

# Properly convert the DATE column to match true input
# https://stackoverflow.com/questions/43230470/how-to-convert-excel-date-format-to-proper-date-in-r
atm_data_raw <- atm_data_raw %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

# Initial output to see data
#head(atm_data_raw)

# Output summary for high level assessment
summary(atm_data_raw)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19
# Check dimensions to understand breadth of data
dim(atm_data_raw)
## [1] 1474    3
# 1474    3
#Cash in hundreds

Dates of dataset start at 2009-05-01 and end with 2010-05-14. This indicates 379 dates, which is 14 more than a single year of 365 days.

Dimensions of the initial dataset indicate 1474 observations, which again indicates more than a single year’s worth of data. The 3 columns are DATE, ATM, and Cash. The DATE indicates the specific data, the ATM indicates which of the 4 ATMs, and Cash represents the total amount withdrawn for the given date and ATM.

Cash 19 NA’s

Data observations: ATM1: 3 Cash NA values ATM2: 2 Cash NA values ATM3: Only 3 Cash values with something above zero ATM4: Many Cash values with a decimal, but not all, something weird there, also ATM4 appears to have 1 really crazy outlier

Total by day and see if there is an overall relationship to be understood

# Define as tsibble with DATE as index
atm_data_ts <- atm_data_raw %>%
  as_tsibble(index = DATE, key = ATM)

# Output tsibble to confirm proper format and definition
atm_data_ts %>%
  filter(DATE > "2010-04-30")
## # A tsibble: 14 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

A closer look at the data shows 14 observations starting with date 2010-05-01 and ending on 2010-05-14 do not indicate an ATM nor a Cash amount, thus these 14 observations will be ignored in the upcoming evaluation. The removal of thse 14 observations also makes for a cleaner dataset, as now the total observations is 1460, which is exacting 365 for each of the 4 ATMs.

Plot all data

atm_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM Withdrawls")

TODO: Comment here

ATM1

# Filter to ATM1 data
atm1_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM1')

# Output summary
summary(atm1_data_ts)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3

The summary of data for ATM1 confirms the 365 days (single year) of observations and also indicates 3 missing Cash values.

atm1_data_ts$DATE[is.na(atm1_data_ts$Cash)]
## [1] "2009-06-13" "2009-06-16" "2009-06-22"

The 3 dats with missing Cash values are displayed above. Not sure the significance of the missing data occurring in a small window of time, so will consider imputation.

atm1_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM1")

Initial plot of the ATM1 data appears to show a weekly seasonal component along with a dip between October 2009 and January 2010. There does not appear to be a clear increasing or decreasing trend.

Impute Missing Data

Given the low volume of missing values, three, I will simply impute the data with the median value of the ATM1 dataset.

# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median

summary(atm1_data_ts)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.95  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00

Summary output confirms no missing data for ATM1 Cash amounts.

Simply updating the ATM1 data tsibble is defined properly.

atm1_data_ts <- atm1_data_ts %>%
  mutate(DATE = as_date(DATE)) %>%
  update_tsibble(index = DATE)
#atm1_data_ts

Decomposition

To better understand the trend and seasonal nature of the ATM1 dataset, decomposition is performed.

# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm1_dcmp %>% components() %>% autoplot()

STL decomposition shows the remainder plot has the most impact on the data based on the gray bar on the left. The seasonal plot shows a clear seasonal pattern, but while the seasonal pattern remains the same, the remainder plot shows greater variance toward the end of plot, beginning in February 2010. The trend line confirms no real trend present in the dataset.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The above seasonally adjusted trendline confirms that despite a few outliers through December 2009, the seasonal component does well to address the weekly nature of the data, but the plot above indicates starting in February 2010, the seasonal component does not properly define the dataset. Based on the above plot, a new weekly pattern appears to emerge in February 2010.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

Above plot shows that the seasonally adjusted data does not account for the weekly seasonal nature for the last three months. Clearly a shift in the weekly seasonal pattern has occurred.

atm1_data_ts %>% ACF() %>% autoplot()

The ACF of the ATM1 data shows a clear correlation at lags 7, 14, and 21, which is expected for seasonal data following a weekly pattern.

atm1_data_ts %>% PACF() %>% autoplot()

The PACF plot re-confirms the correlations at lag 7 and 14. Interesting, correlation also appears at lags 2 and 5.

atm1_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$USD (in hundreds)", title="ATM Withdrawals")

To better understand the weekly seasonal pattern, the gg_season plot above does show some sort of shift on Tuesday and Thursday.

atm1_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$USD (in hundreds)",
    title = "ATM Withdrawals"
  )

The `gg_subseries plot confirms the shift in February occurs on Tuesday, Wednesday, and Thursday. The other days of the week indicate some variance and outliers, but nothing as dramatic as Tuesday through Thursday.

Train and Test Model

Before forecasting the data for May 2010, I want to train and test the different model functions for proper evaluation. The models are trained on 292 days, or approximately 80% of the year represented.

# Create training set (assume 1 year of data)
# 292 days for training, approx. 80% of data
train_atm1 <- atm1_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

#train_atm1

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    Naive = NAIVE(Cash),
    `Seasonal naive` = SNAIVE(Cash),
    `Random walk` = RW(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

fit_atm1 %>% accuracy()
## # A tibble: 7 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM1  Naive          Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 2 ATM1  Seasonal naive Training 0.203  28.1  18.0 -83.3 103.  1     1      0.0957 
## 3 ATM1  Random walk    Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 4 ATM1  Arima          Training 0.179  21.9  14.3 -66.5  82.7 0.790 0.780  0.00341
## 5 ATM1  ETS_Add        Training 0.175  21.4  13.7 -73.3  89.7 0.761 0.763  0.0940 
## 6 ATM1  ETS_Mult       Training 1.98   21.6  13.7 -75.8  90.6 0.760 0.769  0.0956 
## 7 ATM1  ETS_Damp       Training 1.46   21.3  13.3 -76.4  90.4 0.737 0.759  0.0756

Using the different model functions from the Hyndman textbook, the best performing model based on RMSE is the ETS model with dampening. Overall, the three ETS models and the ARIMA model all perform well in comparison. The seasonal naive model also performe well with an RMSE slightly higher than the ARIMA model and the three ETS models.

# Generate forecasts for 72 days
fc_atm1 <- fit_atm1 %>% forecast(h = "72 days")

fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 7 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   -2.20  50.8  37.8 -466.  502.  2.09  1.81 -0.0565 
## 2 ETS_Add        ATM1  Test   -3.25  52.4  38.0 -479.  514.  2.11  1.86  0.00348
## 3 ETS_Damp       ATM1  Test  -10.4   54.2  40.0 -517.  547.  2.22  1.93 -0.0246 
## 4 ETS_Mult       ATM1  Test   -7.03  52.9  39.2 -494.  527.  2.17  1.88 -0.00808
## 5 Naive          ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 6 Random walk    ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 7 Seasonal naive ATM1  Test   -5.21  64.1  53.7 -460.  509.  2.97  2.28  0.00878

The accuracy of the models on the test dataset shows ARIMA performing the best with an RMSE of 50.77. The three ETS models do also perform well.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-02-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-02-10" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

So plotting the seven forecasts along with the original dataset shows the seasonal pattern of the forecasts does not match the seasonal pattern of the actual dataset. Based on the observations in the decomposition section, the weekly pattern has changed, and thus training the model on the first 80% of the data actually misses the new pattern found in the data. Further investigation is needed.

fit_atm1_arima <- train_atm1 %>%
  model(ARIMA(Cash))

# Check the residuals of Seasonal Naive
fit_atm1_arima %>% gg_tsresiduals()

To further confirm the assumption of a bad model, I’ve taken the ARIMA model, the best performing based on RMSE, and displayed the residuals above. The ACF plot of the residuals clearly indicates a correlation exists at lags 8 and 20. This indicates the model isn’t quite capturing the correlation of the test data properly.

Train and Test Model: Second Attempt

Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. Yes, this certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.

# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm1)
## # A tibble: 5 × 12
##   ATM   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots    MSE   AMSE
##   <chr> <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>    <dbl>  <dbl>
## 1 ATM1  Seaso… 575.        NA    NA    NA    NA  <NULL>   <NULL>      NA     NA 
## 2 ATM1  Arima  311.      -219.  447.  448.  455. <cpl [9… <cpl [0…    NA     NA 
## 3 ATM1  ETS_A… 327.      -281.  582.  586.  602. <NULL>   <NULL>     276.   330.
## 4 ATM1  ETS_M…   0.155   -308.  636.  641.  657. <NULL>   <NULL>    1046.  1437.
## 5 ATM1  ETS_D…   0.405   -329.  683.  692.  710. <NULL>   <NULL>   27524. 41523.
## # … with 1 more variable: MAE <dbl>

Now using only the 5 top performing models due to the seasonal nature of the data, the ARIMA model performs the best of the 4 models by evaluating the AICc of 447.

fit_atm1 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Trai…  -5.20   24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  Arima          Trai…   0.138  16.0  11.9  -97.6 143.  0.753 0.660 -0.127 
## 3 ATM1  ETS_Add        Trai…  -1.19   16.6  11.7  -56.5  68.7 0.739 0.684  0.0722
## 4 ATM1  ETS_Mult       Trai…  -2.40   32.3  20.8 -125.  145.  1.31  1.33   0.480 
## 5 ATM1  ETS_Damp       Trai… -10.4   166.   61.7  -13.8  68.2 3.89  6.82   0.412

The ARIMA model performs the best on the training set with an RMSE of 16.05, while the ETS Additive performs almost as well as the ARIMA model. The Seasonal Naive model performs third best and the ETS with dampening is clearly not doing well compared to the other four models.

# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm1 <- fit_atm1 %>% forecast(h = "16 days")

fc_atm1 %>% accuracy(new_seasonal_atm1)
## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   7.71  12.8   9.50 40.9  43.2  0.599 0.527  0.0154 
## 2 ETS_Add        ATM1  Test   1.06  11.8   8.88  5.50 15.3  0.560 0.486 -0.337  
## 3 ETS_Damp       ATM1  Test  46.6   54.0  46.6  57.3  58.4  2.94  2.22   0.0345 
## 4 ETS_Mult       ATM1  Test   9.17  15.1  11.0  -2.95 26.2  0.694 0.621 -0.161  
## 5 Seasonal naive ATM1  Test   0.125  9.64  7.12  1.26  9.83 0.449 0.397 -0.00791

For the accuracy of the five models on the test set, the Seasonal Naive model actually performs the best with an RMSE of 9.64, while ARIMA and ETS Additive, also perform well.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset.

fit_atm1_arima<- train_atm1 %>%
  model(ARIMA(Cash))

# Check the residuals of ARIMA
fit_atm1_arima %>% gg_tsresiduals()

As above, I’ve displayed the residuals of the ARIMA model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise after the first 10 observations or so. The ACF plot indicates no correlation outside of the confidence intervals.

# Consider Box-Cox
lambda <- new_seasonal_atm1 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm1 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

Applying Box-Cox to see if I can squeeze out a little better performance on the test data.

# Forecast for May 2010
fit_atm1_bc <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A"))
  )

report(fit_atm1_bc)
## # A tibble: 3 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM1  Seasonal…   323.     NA    NA    NA    NA  <NULL>   <NULL>     NA    NA 
## 2 ATM1  ARIMA(bo…   175.   -206.  417.  418.  423. <cpl [1… <cpl [7…   NA    NA 
## 3 ATM1  ETS_Add     182.   -264.  548.  552.  568. <NULL>   <NULL>    154.  185.
## # … with 1 more variable: MAE <dbl>

The AICc of the ARIMA model with Box-Cox transformation has improved from 447.73 to 417.98. A promising sign.

fit_atm1_bc %>% accuracy()
## # A tibble: 3 × 11
##   ATM   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Train… -5.20  24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  ARIMA(box_cox… Train… -2.56  16.1  12.3 -105.  127.  0.773 0.661 -0.313 
## 3 ATM1  ETS_Add        Train… -1.11  16.6  12.0  -66.2  77.5 0.755 0.682  0.0836

The accuracy of the models with Box-Cox show little to no improvement over the models without transformations on the training set. The RMSE of the ARIMA model with the transformation actually increases from 16.04555 to 16.06325 as noted above.

# Generate forecasts for the next 16 days
fc_atm1_bc <- fit_atm1_bc %>% forecast(h = "16 days")

fc_atm1_bc %>% accuracy(new_seasonal_atm1)
## # A tibble: 3 × 11
##   .model          ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>           <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA(box_cox(… ATM1  Test  -6.25  11.5   9.35 -32.6  36.2 0.589 0.474 -0.360 
## 2 ETS_Add         ATM1  Test  -0.247 11.8   9.24 -24.8  35.4 0.582 0.487 -0.398 
## 3 Seasonal naive  ATM1  Test  -1.05   9.87  7.88 -21.8  29.4 0.497 0.406 -0.0900

The RMSE of the ARIMA model with Box-Cox transformation on the test data does improve the RMSE compared to the ARIMA model without transformation from 12.812248 to 11.53609.

# Plot forecasts against actual values
fc_atm1_bc %>%
  autoplot(filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot shows the forecasts of the models with transformation along with the actual data. Similar to the plot from the models with transformations, the forecasts do follow the seasonal pattern of the data.

ATM1 Conclusion

Given the assignment is to forecast the data for May 2010 from the provided data, I believe the models generated from the final two and a half months is the appropriate approach. Clearly, a shift in the weekly pattern occurs in February 2010 and without any additional context or history, I assume that shift will continue through May 2010. As the forecast is only 1 month or approximately four more weeks from the end of the provided data, then I choose to believe that new pattern will persist for the next four weeks through May 2010.

ATM2

atm2_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM2')

atm2_data_ts %>%  autoplot(Cash)

# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median

#atm2_data_ts
atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()

Same thing as ATM1, the seasonal nature of the data changes in the final few months. I’m gonna separate the forecast into just the final section where the weekly seasonal nature shifts.

atm2_data_ts %>% ACF() %>% autoplot()

atm2_data_ts %>% PACF() %>% autoplot()

atm2_data_ts %>% gg_season(Cash, period = "week") +
  theme(legend.position = "none") +
  labs(y="$ (in hundreds)", title="ATM Withdrawals")

atm2_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM Withdrawals"
  )

atm2_dcmp <- atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm2_dcmp %>% components() %>% autoplot()

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

ATM3

Now, for ATM3.

atm3_data_ts_all <- atm_data_ts %>%
  filter(ATM == 'ATM3')

summary(atm3_data_ts_all)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

From summary output, the date range (2009-05-01 to 2010-04-03) matches expectations given the data from ATM1 and ATM2. The Cash column does not contain any missing data but does indicate curious aspects. The range is 0 to 96 which might be odd to have 0, but the median is also 0, which certainly indicates a curiosity about the data. The mean is also quite low, less than 1.

atm3_data_ts_all %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

A simply plot of the data clearly shows unexpected information. Only a few days at the end of the dataset contain anything above zero.

Remove data

atm3_data_ts <- atm3_data_ts_all %>%
  filter(ATM == 'ATM3', Cash > 0)

atm3_data_ts
## # A tsibble: 3 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

By removing every date with a value equal to 0, only three dates remain - the final three dates of the dataset.

summary(atm3_data_ts)
##       DATE                ATM                 Cash      
##  Min.   :2010-04-28   Length:3           Min.   :82.00  
##  1st Qu.:2010-04-28   Class :character   1st Qu.:83.50  
##  Median :2010-04-29   Mode  :character   Median :85.00  
##  Mean   :2010-04-29                      Mean   :87.67  
##  3rd Qu.:2010-04-29                      3rd Qu.:90.50  
##  Max.   :2010-04-30                      Max.   :96.00

For those three observations, the range is 82 to 96 with a mean of 87.67.

atm3_data_ts %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

The plot confirms that only three dates contain a Cash amount greater than zero.

ATM3 Model

A decision much be made in how to approach this dataset. Of 365 days, 362 contain a Cash amount of zero and the final three dates contain a Cash amount meeting expectations. Without history or context of ATM3, I will make the assumption that ATM3 was broken or in some way not usable for everyday with a Cash amount of zero. My assumption would follow that starting on 2010-04-28 ATM3 was fixed or made active, and thus can be assumed working and active for the month of May 2010.

With so little data to build a model, I will choose the MEAN function based on the three dates containing Cash amounts above zero.

fit_atm3 <- atm3_data_ts %>% model(MEAN(Cash))

report(fit_atm3)
## Series: Cash 
## Model: MEAN 
## 
## Mean: 87.6667 
## sigma^2: 54.3333

The model returns an expected value of 87.67 as the mean, which was indicated in the summary of the data.

# Forecast for 31 days of May 2010
fc_atm3 <- fit_atm3 %>% forecast(h = "31 days")

fc_atm3 %>%
  autoplot(filter_index(atm3_data_ts_all, "2009-05-01" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-04-28" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(y="Cash (in hundreds $USD)", title="ATM3 Withdrawals with Forecast") +
  guides(colour = guide_legend(title = "Forecast"))

Not a very interesting plot of the forecasted data in relation to the ATM3 dataset, but given the minimal amount of data, the mean forecast is a constant line for the month of May 2010.

ATM3 Conclusion

There’s only 3 values with data above zero, so I’m assuming there was absolutely no activity but the ATM3 existed, or else it would have NA.

ATM4

atm4_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM4')

summary(atm4_data_ts)
##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762
#atm4_data_ts

As expected, the summary for the ATM4 dataset falls on the dates 2009-05-01 through 2010-04-30. The Cash column indicats no missing values. The range of the Cash column is 1.563 to 10919.762 with a mean of 474.043. The low value is certainly low, but thankfully not zero, as addressed in ATM3. The high value is quite high compared to the mean and the third quartile. Perhaps an outlier or a few exists in the ATM4 data.

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

The plot of the ATM4 data shows a clear outlier in February 2010.

atm4_data_ts %>%
  filter(Cash > 3000)
## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

The day of the outlier turns out to be 2010-02-09.

Decomposition

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm4_dcmp %>% components() %>% autoplot()

Given the weekly seasonal pattern worked for ATM1 and ATM2, let’s try decomposition on ATM4. Given the plot above, the outlier makes the plots near impossible to evaluate.

Impute outlier and missing data

Let’s the remove the outlier by imputing the median value in its place.

# Calculate median value for ATM
# Straightforward approach to impute data
median <- median(atm4_data_ts$Cash, na.rm=TRUE)

# Get rid of outlier from ATM4
atm4_data_ts$Cash[atm4_data_ts$Cash > 3000] <- median 

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

By removing the outlier, the data does appear more approachable. Overall, I don’t detect a clear trend, perhaps a slight decrease. Given the data represents a single year, I don’t expect to find a cyclic component, but perhaps a weekly seasonal component exists. Let’s try the decomposition again, now excluding the outlier.

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

The remainder component does play a larger factor than the seasonal component. As identified in ATM1 and ATM2, the trend component has minimal impact. Let’s try the seasonally adjusted plot.

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The seasonally-adjusted trendline plot above does not convey any clear weekly pattern.

atm4_data_ts %>% ACF() %>% autoplot()

The ACF plot indicates correlation at lags 7, 14, and 21 as expected. Also, correlation appears at lag 10.

atm4_data_ts %>% PACF() %>% autoplot()

The PACF shows correlation at lags 7, 10, and 19.

atm4_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$ (in hundreds)", title="ATM4 Withdrawals")

I’ll admit the above plot of gg_season does not provide much value except perhaps for some nice colors.

atm4_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM Withdrawals"
  )

The `gg_subseries`` plot above does indicate a change in pattern on Tuesday, Wednesday, and Thursday, similar to that found in ATM1 and ATM2, but not quite as drastic and clear.

atm4_data_post0209 <- atm4_data_ts %>%
  filter_index("2010-02-10" ~ "2010-04-30")

atm4_data_post0209 %>%
  autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

atm4_dcmp <- atm4_data_post0209 %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-10" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

Still too irregular. Try ARIMA and ETS and see what sort of results I can get.

Train and Test

Given I’m struggling to find a viable weekly seasonal pattern through visual inspection, I will apply the different seasonal models through the train and test exercise.

# Create training set (1 year of data)
# 292 days for training, approx. 80% of data
train_atm4 <- atm4_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)
## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>       NA 
## 2 ATM4  Arima        1.25e+5  -2134. 4274. 4274. 4285. <cpl [7… <cpl [0…     NA 
## 3 ATM4  ETS_Add      1.05e+5  -2521. 5063. 5063. 5099. <NULL>   <NULL>   101739.
## 4 ATM4  ETS_Mult     5.14e-1  -2469. 4958. 4959. 4995. <NULL>   <NULL>   103349.
## 5 ATM4  ETS_Damp     4.88e-1  -2478. 4981. 4982. 5029. <NULL>   <NULL>   106682.
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal naive Trai…   2.11    456.  347. -366.  422. 1     1     0.0620
## 2 ATM4  Arima          Trai…   0.0482  352.  295. -501.  535. 0.850 0.772 0.0819
## 3 ATM4  ETS_Add        Trai…  -9.31    319.  249. -421.  450. 0.717 0.699 0.110 
## 4 ATM4  ETS_Mult       Trai…  11.5     321.  251. -412.  444. 0.723 0.705 0.108 
## 5 ATM4  ETS_Damp       Trai… -18.3     327.  257. -434.  463. 0.741 0.716 0.0931
# Generate forecasts for 72 days
fc_atm4 <- fit_atm4 %>% forecast(h = "72 days")

fc_atm4 %>% accuracy(atm4_data_ts)
## # A tibble: 5 × 11
##   .model         ATM   .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test   -19.2   319.  266.  -777.  808. 0.768 0.699 -0.0357 
## 2 ETS_Add        ATM4  Test   -20.7   386.  319.  -918.  960. 0.919 0.847  0.0325 
## 3 ETS_Damp       ATM4  Test   -34.1   400.  330. -1017. 1059. 0.951 0.878  0.00763
## 4 ETS_Mult       ATM4  Test    -5.37  387.  321.  -913.  958. 0.925 0.849  0.00836
## 5 Seasonal naive ATM4  Test  -101.    520.  417. -1484. 1530. 1.20  1.14  -0.0989
# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-02-15" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm4_data_ts, "2010-02-15" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot indicates the models are not following the weekly seasonal pattern. Let me try the truncated version similar to ATM1 and ATM2.

# ATM4 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)
## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>   NA     
## 2 ATM4  Arima        1.11e+5   -419.  841.  841.  845. <cpl [0… <cpl [0… NA     
## 3 ATM4  ETS_Add      9.95e+4   -447.  913.  918.  934. <NULL>   <NULL>    8.41e4
## 4 ATM4  ETS_Mult     5.76e-1   -452.  924.  928.  944. <NULL>   <NULL>    4.57e5
## 5 ATM4  ETS_Damp     3.74e+0   -517. 1060. 1068. 1087. <NULL>   <NULL>    4.09e6
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model     .type         ME  RMSE   MAE     MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>      <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal … Train… -1.05e+ 1  453.  349. -524.    578. 1     1     0.0769
## 2 ATM4  Arima      Train…  1.94e-10  330.  280. -747.    780. 0.803 0.728 0.0556
## 3 ATM4  ETS_Add    Train… -3.45e+ 1  290.  228. -204.    236. 0.652 0.640 0.0543
## 4 ATM4  ETS_Mult   Train… -1.75e+ 2  676.  391. -301.    328. 1.12  1.49  0.512 
## 5 ATM4  ETS_Damp   Train… -1.51e+ 1 2022. 1372.   -4.63  506. 3.93  4.46  0.803
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm4 <- fit_atm4 %>% forecast(h = "16 days")

fc_atm4 %>% accuracy(new_seasonal_atm4)
## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test  -135.   278.  229.  -924.  937. 0.655 0.613 -0.308  
## 2 ETS_Add        ATM4  Test  -147.   274.  213.  -636.  645. 0.610 0.605  0.0327 
## 3 ETS_Damp       ATM4  Test   616.   690.  616.   670.  670. 1.77  1.52  -0.194  
## 4 ETS_Mult       ATM4  Test   -92.1  242.  189.  -618.  632. 0.542 0.534 -0.00529
## 5 Seasonal naive ATM4  Test  -244.   449.  370. -1178. 1201. 1.06  0.991  0.219
# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

fit_atm4_ets <- train_atm4 %>%
  model(ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")))

# Check the residuals of ARIMA
fit_atm4_ets %>% gg_tsresiduals()

ETS Multiplicative appears to perform best

Box-cox

# Consider Box-Cox
lambda <- new_seasonal_atm4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm4 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM4 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

fit_atm4_bc <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
  )

report(fit_atm4_bc)
## # A tibble: 4 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM4  Seasonal… 53.7       NA    NA    NA    NA  <NULL>   <NULL>    NA    NA  
## 2 ATM4  ARIMA(bo… 23.6     -176.  360.  361.  368. <cpl [1… <cpl [0…  NA    NA  
## 3 ATM4  ETS_Add   25.0     -206.  432.  437.  453. <NULL>   <NULL>    21.1  18.5
## 4 ATM4  ETS_Mult   0.135   -216.  451.  456.  472. <NULL>   <NULL>    30.5  31.0
## # … with 1 more variable: MAE <dbl>
fit_atm4_bc %>% accuracy()
## # A tibble: 4 × 11
##   ATM   .model          .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>           <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM4  Seasonal naive  Train… -10.5  453.  349. -524.  578. 1     1      0.0769
## 2 ATM4  ARIMA(box_cox(… Train…  60.6  293.  239. -218.  257. 0.684 0.646 -0.0113
## 3 ATM4  ETS_Add         Train…  51.8  298.  218. -126.  158. 0.626 0.658  0.0641
## 4 ATM4  ETS_Mult        Train…  11.0  370.  283. -310.  348. 0.810 0.816  0.144
# Generate forecasts for the next 16 days
fc_atm4_bc <- fit_atm4_bc %>% forecast(h = "16 days")

fc_atm4_bc %>% accuracy(new_seasonal_atm4)
## # A tibble: 4 × 11
##   .model           ATM   .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>            <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(box_cox(C… ATM4  Test  -174.  321.  246. -1011. 1025. 0.705 0.708 0.380 
## 2 ETS_Add          ATM4  Test  -207.  341.  275.  -737.  747. 0.788 0.752 0.0637
## 3 ETS_Mult         ATM4  Test  -317.  459.  363.  -922.  930. 1.04  1.01  0.401 
## 4 Seasonal naive   ATM4  Test  -616.  823.  705. -2079. 2094. 2.02  1.82  0.433
# Plot forecasts against actual values
fc_atm4_bc %>%
  autoplot(filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Box-Cox was not better, so go with model without transformation

Forecast

Overall daily totals across all four ATMs

atm_data_total_by_day <- atm_data_ts %>%
  index_by(DATE) %>%
  summarize(Cash_Total = sum(Cash))

atm_data_total_by_day

atm_data_total_by_day %>% autoplot(Cash_Total)

atm_data_total_by_day %>%
  model(stl = STL(Cash_Total ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()


# Seasonally adjusted plot
atm_dcmp <- atm_data_total_by_day %>%
  model(stl = STL(Cash_Total ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm_dcmp %>% components() %>% autoplot()


components(atm_dcmp) %>%
  as_tsibble() %>%
  autoplot(Cash_Total, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

Part B – Residential Customer Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")

# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'

# Display first 5 rows for visual inspection
#head(power_data_raw)

# Display summary for initial assessment
summary(power_data_raw)
##   CaseSequence      Month                KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
# Display dimensions for assessment
# should be 16 years of data 1998-2013
# Yep, 192/12 = 16
dim(power_data_raw)
## [1] 192   3
# 192   3
# KWH

power_data_ts <- power_data_raw %>%
  mutate(Month = yearmonth(Month)) %>%
  mutate(KWH = KWH/1e3) %>% # In thousands
  as_tsibble(index = Month)

# Output first 5 rows of tsibble
#head(power_data_ts)

# Inital plot of data
power_data_ts %>%
  autoplot(KWH)

# DO I NEED OR WANT THIS PLOT?
power_data_ts %>%
  filter_index("2010" ~ "2011")
## # A tsibble: 13 x 3 [1M]
##    CaseSequence    Month   KWH
##           <dbl>    <mth> <dbl>
##  1          877 2010 Jan 9397.
##  2          878 2010 Feb 8391.
##  3          879 2010 Mar 7348.
##  4          880 2010 Apr 5776.
##  5          881 2010 May 4919.
##  6          882 2010 Jun 6696.
##  7          883 2010 Jul  771.
##  8          884 2010 Aug 7923.
##  9          885 2010 Sep 7819.
## 10          886 2010 Oct 5876.
## 11          887 2010 Nov 4801.
## 12          888 2010 Dec 6153.
## 13          889 2011 Jan 8395.

First observations, data is Monthly, so I’d expect a seasonal component. 1 month is missing KWH has 1 NA value (2008 Sep) Considering imputing with median Sep Value Outlier (with very small value in July 2010) Considering imputing with median Jul Value

Impute data

#power_data_ts

# I want all the September values
sep_data <- power_data_ts %>%
  filter(str_detect(Month, "Sep"))

sep_data <- as_tibble(sep_data)

sep_kwh_med <- sep_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

sep_kwh_med
## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7667.
# Median 7666.97
#power_data_ts[power_data_ts$Month == "2008 Sep"] <- sep_kwh_med

power_data_ts[129, 3] <- sep_kwh_med

power_data_ts
## # A tsibble: 192 x 3 [1M]
##    CaseSequence    Month   KWH
##           <dbl>    <mth> <dbl>
##  1          733 1998 Jan 6863.
##  2          734 1998 Feb 5838.
##  3          735 1998 Mar 5421.
##  4          736 1998 Apr 5010.
##  5          737 1998 May 4665.
##  6          738 1998 Jun 6467.
##  7          739 1998 Jul 8915.
##  8          740 1998 Aug 8607.
##  9          741 1998 Sep 6990.
## 10          742 1998 Oct 6346.
## # … with 182 more rows
# I want all the July values
jul_data <- power_data_ts %>%
  filter(str_detect(Month, "Jul"))

jul_data <- as_tibble(jul_data)

jul_kwh_med <- jul_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

jul_kwh_med
## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7679.
# Median 7678.623   

power_data_ts[151, 3] <- jul_kwh_med

power_data_ts %>% autoplot(KWH)

Perhaps a Box-Cox is needed here, there does seem to be increasing variance

power_dcmp <- power_data_ts %>%
  model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) 
power_dcmp %>% components() %>% autoplot()

components(power_dcmp) %>%
  as_tsibble() %>%
  autoplot(KWH, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "KWH (in thousands)", title = "Seasonally Adjusted Trendline")

Seasonally adjusted doesn’t look too bad

# Seasonal differencing
power_data_ts %>%
  gg_tsdisplay(difference(KWH, 12),
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")

# ARIMA model
#power_data_ts %>% gg_tsdisplay()

# https://otexts.com/fpp3/seasonal-arima.html
arima_mod_power <- power_data_ts %>% model(ARIMA(KWH, stepwise=F, approx=F))

report(arima_mod_power)
## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.3444  0.0281  0.2310  -0.6994  -0.4180  228.6842
## s.e.  0.0786  0.0892  0.0742   0.0779   0.0784   75.9032
## 
## sigma^2 estimated as 379089:  log likelihood=-1412.61
## AIC=2839.22   AICc=2839.87   BIC=2861.57
arima_mod_power %>% gg_tsresiduals(lag=36)

forecast(arima_mod_power, h=36) %>%
  autoplot(power_data_ts) +
  labs(title = "KWH Used",
       y="KWH (in thousands)")

Train and Test Model

# Train on 154 months, which is 12 years and 8 months (80% of the total)
# Total is 192 months (16 years)

train_power_data <- power_data_ts %>% 
  filter_index("1998-Jan" ~ "2010-Oct")

Box-Cox

lambda <- power_data_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

train_power_data %>%
  autoplot(box_cox(KWH, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

# Fit the models
fit_power <- train_power_data %>%
  model(
    Naive = NAIVE(box_cox(KWH, lambda)),
    `Seasonal naive` = SNAIVE(box_cox(KWH, lambda)),
    `Random walk` = RW(box_cox(KWH, lambda)),
    Arima_BC = ARIMA(box_cox(KWH, lambda)),
    Arima = ARIMA(KWH)
  )

fit_power$Arima_BC
## <lst_mdl[1]>
## [1] <ARIMA(1,0,1)(0,1,1)[12] w/ drift>
fit_power$Arima
## <lst_mdl[1]>
## [1] <ARIMA(0,0,3)(0,1,1)[12] w/ drift>
#fit_atm1_snaive <- train_atm1 %>%
#  model(SNAIVE(Cash))

fit_power %>% accuracy()
## # A tibble: 5 × 10
##   .model         .type        ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Naive          Training  -6.45 1289. 1069. -2.21  17.2  1.92  1.80   0.182  
## 2 Seasonal naive Training  69.0   715.  557.  0.478  8.79 1     1      0.267  
## 3 Random walk    Training  -6.45 1289. 1069. -2.21  17.2  1.92  1.80   0.182  
## 4 Arima_BC       Training   7.10  518.  397. -0.309  6.34 0.713 0.725  0.0885 
## 5 Arima          Training -11.7   499.  383. -0.687  6.16 0.688 0.698 -0.00393
# Check the residuals of Seasonal Naive
#fit_atm1_snaive %>% gg_tsresiduals()

# Generate forecasts for 38 months, so we'll see if it picks up the seasonal nature
fc_power <- fit_power %>% forecast(h = "38 months")

fc_power %>% accuracy(power_data_ts)
## # A tibble: 5 × 10
##   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima          Test    661. 1068.  777.   7.64  9.57  1.40  1.49 0.224
## 2 Arima_BC       Test    621. 1034.  757.   7.25  9.42  1.36  1.45 0.229
## 3 Naive          Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 4 Random walk    Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 5 Seasonal naive Test    303. 1018.  850.   2.77 11.0   1.53  1.42 0.419
# Plot forecasts against actual values
fc_power %>%
  autoplot(power_data_ts, level = NULL) +
  autolayer(
    filter_index(power_data_ts, "1998-Jan" ~ "2010-Oct"),
    colour = "black"
  ) +
  labs(
    y = "KWH (in thousands)",
    title = "Power Forecast"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

# Result: Not good, SNAIVE at least attempts the seasonal nature

ARIMA_BC is the best RMSE Need to try the ETS

Part C – Waterflow (optional)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

# Read in data
pipe1_data_raw <- read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))

# From: https://stackoverflow.com/questions/53635818/convert-datetime-from-excel-to-r

pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")

pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")


# Initial output to see data
head(pipe1_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:24:06     23.4 
## 2 2015-10-23 00:40:02     28.0 
## 3 2015-10-23 00:53:51     23.1 
## 4 2015-10-23 00:55:40     30.0 
## 5 2015-10-23 01:19:17      6.00
## 6 2015-10-23 01:23:58     15.9
head(pipe2_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2
summary(pipe1_data_raw)
##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 00:24:06   Min.   : 1.067  
##  1st Qu.:2015-10-25 11:21:35   1st Qu.:13.683  
##  Median :2015-10-27 20:07:30   Median :19.880  
##  Mean   :2015-10-27 20:49:15   Mean   :19.897  
##  3rd Qu.:2015-10-30 08:24:51   3rd Qu.:26.159  
##  Max.   :2015-11-01 23:35:43   Max.   :38.913
summary(pipe2_data_raw)
##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303
dim(pipe1_data_raw)
## [1] 1000    2
dim(pipe2_data_raw)
## [1] 1000    2

Define as tsibble

atm_data_ts <- atm_data_raw %>% as_tsibble(index = DATE, key = ATM)

atm_data_ts ```

#30#